The first portion (sections 1 to 6) of this document provides an overview of the background, motivation, and design of Project STAR, as well as the shortcomings of the study design. It is mostly similar to that covered in the initial analysis report.
This final report adds to the initial analysis by extending the investigation to explore the long-term effects of small class sizes on student academic achievement. The subsequent sections (7 to 10) details procedures for addressing the new question of interest, including data preparation, variable selection, and statistical analysis.
To skip to key conclusions from the initial analysis and new results, click here.
Due to the vast nature of the data, there are many questions/caveats that can be explored, but this analysis will mainly focus on the long-term effects of class size on student academic achievement. Areas beyond this scope will be recognized but also noted that they are not within the scope of this analysis.
This study investigates the short and long-term effects of small class sizes during early education on student academic achievement using data from Tennessee’s Project STAR (Student/Teacher Achievement Ratio). Employing statistical analyses, we examine how participation in small, regular, or regular-aide classes in grades K–3 impacts students’ academic performance in early elementary years, subsequent grades 4–8, high school GPA, and standardized college readiness test scores (ACT/SAT). Initial findings confirm significant immediate benefits of smaller class sizes on early academic achievement. However, longitudinal analyses on 407 students with complete data (detailed in the subsequent sections) reveal inconsistent or negligible long-term effects beyond grade 3, suggesting that early educational advantages related to class size may diminish over time. This research highlights critical insights into the durability of educational interventions and informs future policy decisions on class size reduction.
Educational researchers and policymakers have long debated the impact of class size on student academic achievement. Project STAR, initiated by Tennessee in the mid-1980s, was a pioneering randomized controlled trial designed explicitly to explore this issue by assigning students to small classes (13–17 students), regular classes (22–25 students), or regular classes with a full-time teacher’s aide. This landmark study provided a unique opportunity to evaluate the immediate and long-term consequences of reduced class sizes from kindergarten through third grade and subsequently tracked student outcomes through high school.
This report builds upon previous analyses of Project STAR data, initially focused on immediate outcomes, by extending the scope to examine long-term academic impacts. Specifically, it evaluates the hypothesis that early exposure to small classes significantly affects students’ academic trajectories and readiness for higher education. The study rigorously addresses limitations such as student mobility, attrition, and potential confounding factors introduced by the operational adjustments within Project STAR. By systematically assessing academic performance across multiple stages of students’ educational careers, this analysis contributes to a nuanced understanding of whether early class-size reductions yield lasting academic benefits, thereby informing educational policy and future research directions.
Project STAR, short for the Student/Teacher Achievement Ratio Project, emerged in the mid-1980s as a pioneering effort to explore the relationship between class size and student academic outcomes. Motivated by policy concerns regarding the efficacy of smaller classes, the study was funded by the Tennessee General Assembly and implemented as a randomized controlled trial. In this experiment, students and teachers were randomly assigned to one of three classroom environments—small classes (13–17 students), regular classes (22–25 students), and regular classes with a teacher’s aide. This design was intended to isolate the effect of class size on academic performance while controlling for potential confounders, thereby providing strong evidence on the causal impacts of educational settings.
Beyond its initial focus on early childhood education, Project STAR was designed as a longitudinal study that followed students from kindergarten through third grade, and later into high school. This extended follow-up allowed researchers to investigate long-term outcomes, including high school achievement, graduation rates, and preparedness for higher education. The extensive dataset, which encompasses detailed academic records, teacher assessments, and demographic information, has been invaluable in shaping educational policy and research. By systematically analyzing the long-term effects of early educational interventions, Project STAR has contributed significantly to our understanding of how classroom environments influence academic trajectories and overall student success.
library(dplyr)
library(ggplot2)
library(ggthemes)
library(knitr)
library(kableExtra)
library(patchwork)
library(car)
library(MASS)
library(broom)
library(AER)
library(foreign)
library(forcats)
library(tidyr)
library(ggalluvial)
library(plotly)
library(patchwork)
library(AER)
library(broom)
STAR_Students <- read.spss('dataverse_files/PROJECT STAR/STAR_Students.sav', to.data.frame=TRUE)
# Comparison_Students <- read.spss('dataverse_files/PROJECT STAR/Comparison_Students.sav', to.data.frame=TRUE)
STAR_K3_Schools <- read.spss('dataverse_files/PROJECT STAR/STAR_K-3_Schools.sav', to.data.frame=TRUE)
STAR_High_Schools <- read.spss('dataverse_files/PROJECT STAR/STAR_High_Schools.sav', to.data.frame=TRUE)
This investigation utilizes the STAR-and-Beyond database from the Harvard Dataverse, which contains detailed information on students, teachers, and schools involved in Project STAR. The dataset includes records from the original STAR study, as well as follow-up data from high school if available.
The primary student-level data file contains information on 11,601 students who participated in the experimental phase for at least one year between 1985 and 1989. Information for each of grades K-3 includes:
As part of the extended follow-up, added to the records of some or all students, include:
Note: This investigation does not necessarily encompass all variables in the dataset, but rather focuses on key areas of interest related to class size and student achievement (discussed in the subsequent sections).
Project STAR was initiated following the passage of House Bill (HB) 544 by the Tennessee Legislature in May 1985, aimed at investigating the effects of class size on student achievement and development in primary grades (K–3). The legislation outlined three primary research questions:
To implement this study, the Tennessee State Department of Education established a research consortium involving representatives from the Department, State Board of Education, State Superintendents’ Association, and four Tennessee universities. The study adhered to an experimental design, randomly assigning students entering kindergarten in 1985 or first grade in 1986 to one of three class conditions:
Randomization was executed by consortium members and supervised locally by university-affiliated graduate students, ensuring unbiased assignment based on gender, race, and socioeconomic status.
All Tennessee schools were invited to participate under conditions set by the state, including the random assignment requirement, maintenance of standard school policies aside from class size adjustments, and commitment for four consecutive years. Of the initially interested 180 schools, 79 were ultimately selected from 42 districts to ensure representation of inner-city, suburban, urban, and rural settings:
Participation fluctuated slightly due to mergers and withdrawals, primarily attributed to challenges maintaining randomization and administrative burdens. Consequently, the number of participating schools ranged from 79 in kindergarten to 75 by third grade.
After the initial year, STAR administrators modified the study slightly by randomly redistributing half of the students between regular (R) and regular-aide (RA) classes for subsequent years due to no significant kindergarten performance differences found between these two groups. Small-class assignments remained unchanged. This is a caveat that is addressed in the subsequent analysis.
Teacher training occurred for a subset of second-grade teachers, with no significant difference in student achievement outcomes observed between trained and untrained teachers. Student mobility also influenced class composition, with new entrants randomly assigned while maintaining small-class constraints. This “class size drift” was documented and considered in subsequent analyses.
Academic performance was evaluated annually using the Stanford Achievement Tests (SATs) and the Tennessee Basic Skills First (BSF) tests. Student self-concept and motivation were measured using the SCAMIN inventory. Beyond third grade, additional longitudinal data were collected, including academic performance in grades 4–8 (via the Tennessee Comprehensive Assessment Program, TCAP), student participation and identification with school surveys, college entrance examination data (ACT/SAT), high school transcripts, and graduation/dropout information.
These detailed design features and rigorous methodologies positioned Project STAR as a landmark experimental study capable of robustly determining the causal impacts of class size on educational outcomes. However, the project did not come without limitations and challenges.
Despite its robust experimental design, Project STAR has several notable limitations that should be acknowledged when interpreting its findings:
Project STAR experienced considerable student mobility, resulting in many students not remaining in their assigned class types throughout the study period. Such mobility led to a phenomenon known as “class size drift,” where the actual sizes of regular classes sometimes became similar to those of small classes, potentially diluting the experimental contrast and complicating causal inference.
The purposeful selection of schools, which aimed to cover diverse geographic and socioeconomic areas within Tennessee, might limit the external validity of the findings. Specifically, Project STAR schools were slightly larger and had slightly lower initial achievement scores compared to statewide averages, raising questions about how representative the findings are for other educational contexts.
The project provided only limited teacher training, which did not specifically equip teachers to leverage smaller class sizes effectively. Additionally, training was not uniformly administered, and there was no demonstrated impact of the training itself. Thus, differences in instructional quality or consistency across classes might have influenced outcomes, independent of class size.
Although the study was longitudinal, it only maintained controlled class-size conditions through grade three, after which students returned to standard-sized classes. The analysis of longer-term effects beyond third grade thus faces challenges in isolating the direct impact of early exposure to small classes from subsequent educational experiences.
Aside from controlling for class size and the presence of aides, the study deliberately maintained “normal” school operations. This approach meant that other important classroom variables, such as teaching methods, curriculum variations, and peer dynamics, remained uncontrolled, potentially confounding the observed effects.
In the initial analysis of Project STAR data, we were primarily interested in answering the following two questions:
Primary question: Are there any differences in math scaled scores in 1st grade across class types?
Secondary question: If there are differences, which class type is associated with the highest math scaled scores in 1st grade?
To answer these questions, we adopted the following two-way ANOVA model with the following structure:
\[Y_{ijk} = \mu_{..} + \alpha_{i} + \beta_{j} + \epsilon_{ijk}\] where the index \(i\) represents the class type: small (\(i=1\)), regular (\(i=2\)), regular with aide (\(i=3\)), and the index \(j\) represents the school indicator. The rest of the parameters are as follows:
The assumptions of the two-way ANOVA model are as follows:
We answered the primary question of interest by conducting an F-test to determine if there are significant differences in math scaled scores across class types. The null and alternative hypotheses were as follows:
Assumptions for the F-test include the normality of residuals and homoscedasticity, which remain the same as the two-way ANOVA model.
The F-test results indicated that the p-value for class type
(star1) is less than 0.05, suggesting that there
are significant differences in math scaled scores across class
types. We rejected the null hypothesis and concluded
that at least one class type has a significantly different mean math
score compared to the others.
We implemented the Tukey HSD test to find that students in small classes have significantly higher math scores compared to students in regular classes and regular classes with an aide. However, there was no significant difference between regular classes and regular classes with an aide.
While the initial analysis report provided some valuable insights into the short-term effects of class size on student math scores in 1st grade, several caveats and limitations should be considered:
Limited Focus on Math Scaled Scores: The analysis primarily focused on math scaled scores as the outcome variable, neglecting other subjects or measures of student achievement. This narrow focus might not capture the full spectrum of educational outcomes influenced by class size. Utilizing scores from other subjects or broader achievement metrics could provide a more comprehensive understanding of the impact of class size on student learning. It would also allow for a more comprehensive comparison into the long-term effects of class size on student achievement.
Short-term Analysis: The initial analysis only considered the math scores of 1st-grade students, providing a snapshot of the immediate effects of class size on academic performance. While this short-term perspective is valuable, it fails to capture the long-term implications of early educational experiences. A more comprehensive analysis that tracks student outcomes over multiple grades and years would offer a more nuanced understanding of how class size influences academic trajectories. This would require a longitudinal approach that follows students beyond the early grades and into high school and beyond.
Concern about Operational Adjustments Post-1st Grade: The initial analysis did not account for the operational adjustments made after the first year of the study, such as the redistribution of students between regular and regular-aide classes. While most students who were designated to small classes continued in that setting, students in regular and regular-aide classes were randomly reassigned. In an update in 1999, it was reported that class size and pupil teacher ratios (PTR) are not the same, and that PTR does not influence student outcomes. While there may be some concern about keeping kindergarten student data despite the switching, based on the researchers claims and results from the initial analysis, it is safe to assume that the switching between regular and regular-aide classes did not have a significant impact on the results. Therefore for this final analysis, we will continue to include K-3 data, not just grades 1-3 that had minimal switching.
data_alluvium <- subset(STAR_Students, select = c(gkclasstype, g1classtype, g2classtype, g3classtype))
class_levels <- c("small", "regular", "regular-aide")
data_alluvium <- data_alluvium %>%
mutate(across(everything(), ~ factor(as.numeric(.),
levels = c(1, 2, 3),
labels = class_levels))) %>%
mutate(across(everything(), ~ fct_explicit_na(., na_level = "Unknown (NA)")))
grade_mapping <- c("gkclasstype" = "Kindergarten",
"g1classtype" = "Grade 1",
"g2classtype" = "Grade 2",
"g3classtype" = "Grade 3")
pairs <- list(c("gkclasstype", "g1classtype"),
c("g1classtype", "g2classtype"),
c("g2classtype", "g3classtype"))
flow_list <- lapply(pairs, function(x) {
data_alluvium %>%
group_by(across(all_of(x))) %>%
summarise(value = n(), .groups = "drop") %>%
mutate(
source = paste(grade_mapping[x[1]], ": ", .[[x[1]]], sep = ""),
target = paste(grade_mapping[x[2]], ": ", .[[x[2]]], sep = "")
) %>%
select(source, target, value)
})
transitions <- bind_rows(flow_list)
nodes <- unique(c(transitions$source, transitions$target))
nodes_df <- data.frame(name = nodes, stringsAsFactors = FALSE)
nodes_df$index <- seq_len(nrow(nodes_df)) - 1
transitions <- transitions %>%
left_join(nodes_df, by = c("source" = "name")) %>%
rename(source_index = index) %>%
left_join(nodes_df, by = c("target" = "name")) %>%
rename(target_index = index)
get_class <- function(x) sub(".*: ", "", x)
nodes_df$class_type <- sapply(nodes_df$name, get_class)
class_colors <- c(
"small" = "#F5C310", # yellow
"regular" = "#0072B2", # blue
"regular-aide" = "#009E73", # green
"Unknown (NA)" = "#D55E00" # reddish/orange
)
nodes_df$color <- class_colors[nodes_df$class_type]
transitions$link_color <- scales::alpha(nodes_df$color[ match(transitions$source, nodes_df$name) ], 0.4)
p <- plot_ly(
type = "sankey",
orientation = "h",
arrangement = "snap",
node = list(
label = nodes_df$name,
pad = 15,
thickness = 20,
line = list(color = "black", width = 0.5),
color = nodes_df$color
),
link = list(
source = transitions$source_index,
target = transitions$target_index,
value = transitions$value,
color = transitions$link_color,
opacity = 0.1
)
) %>%
layout(
title = "Alluvial Diagram of Students' Transfer",
font = list(size = 10),
width = 800
)
p
data_alluvium <- STAR_Students %>%
select(gkclasstype, g1classtype, g2classtype, g3classtype) %>%
drop_na() %>%
mutate(across(everything(), ~ factor(as.numeric(.),
levels = c(1, 2, 3),
labels = c("small", "regular", "regular-aide"))))
grade_mapping <- c("gkclasstype" = "Kindergarten",
"g1classtype" = "Grade 1",
"g2classtype" = "Grade 2",
"g3classtype" = "Grade 3")
pairs <- list(c("gkclasstype", "g1classtype"),
c("g1classtype", "g2classtype"),
c("g2classtype", "g3classtype"))
flow_list <- lapply(pairs, function(x) {
data_alluvium %>%
group_by(across(all_of(x))) %>%
summarise(value = n(), .groups = "drop") %>%
mutate(
source = paste(grade_mapping[x[1]], ": ", .[[x[1]]], sep = ""),
target = paste(grade_mapping[x[2]], ": ", .[[x[2]]], sep = "")
) %>%
select(source, target, value)
})
transitions <- bind_rows(flow_list)
nodes <- unique(c(transitions$source, transitions$target))
nodes_df <- data.frame(name = nodes, stringsAsFactors = FALSE)
nodes_df$index <- seq_len(nrow(nodes_df)) - 1
transitions <- transitions %>%
left_join(nodes_df, by = c("source" = "name")) %>%
rename(source_index = index) %>%
left_join(nodes_df, by = c("target" = "name")) %>%
rename(target_index = index)
get_class <- function(x) sub(".*: ", "", x)
nodes_df$class_type <- sapply(nodes_df$name, get_class)
class_colors <- c(
"small" = "#F5C310", # yellow
"regular" = "#0072B2", # blue
"regular-aide" = "#009E73" # green
)
nodes_df$color <- class_colors[nodes_df$class_type]
transitions$link_color <- scales::alpha(nodes_df$color[ match(transitions$source, nodes_df$name) ], 0.4)
p <- plot_ly(
type = "sankey",
orientation = "h",
arrangement = "snap",
node = list(
label = nodes_df$name,
pad = 15,
thickness = 20,
line = list(color = "black", width = 0.5),
color = nodes_df$color
),
link = list(
source = transitions$source_index,
target = transitions$target_index,
value = transitions$value,
color = transitions$link_color
)
) %>%
layout(
title = "Alluvial Diagram of Students' Transfer (NA Removed)",
font = list(size = 10),
width = 800
)
p
In the figures above, we see a big shift between regular and regular-aide classes from kindergarten to grade 1. However, we do not consider this a significant issue. Investigating edge cases with some students switching between small and regular classes will be interesting, but is beyond the scope of this analysis.
With these caveats in mind, we aim to extend the analysis of Project STAR data to explore the long-term effects of class size on student academic achievement. The natural new question of interest is:
For students who complete both primary and secondary education with the objective of pursuing higher education (college), does the exposure to small class sizes in early education (K-3) have a significant impact on their high school academic performance and college readiness?
One more point to keep in mind is that Tennessee implemented a new student assessment system the year STAR students entered grade 4, the Tennessee Comprehensive Assessment Program (TCAP). The TCAP assessment battery included norm-referenced tests from the Comprehensive Tests of Basic Skills (CTBS/McGraw Hill, 1989) and BSF criterion-referenced tests for each grade in reading and mathematics. Scores on these tests were made available by the Tennessee State Department of Education, as students progressed from grade 4 (1989-1990) through grade 8 (1993-1994).
The user guide notes that “Scores on the CTBS are not directly comparable to those on the SATs. However, IRT scale scores were available for each CTBS subtest so that comparisons can be made meaningfully across grades 4—8.” Hence the scaled scores are valid for comparison across grades 4-8.
The new question of interest requires a subset of students who have completed both primary and secondary education, and that we have complete data on their academic performance and college readiness. To achieve this, we will filter the dataset to include only students who have complete data using the flags in the data file the binary flag variables indicate participation/non-participation at each stage of data collection.
| Flag Type | Variable Name | Description |
|---|---|---|
| Achievement-data Flags | flaggk |
Achievement data available kindergarten |
flagg1 |
Achievement data available grade 1 | |
flagg2 |
Achievement data available grade 2 | |
flagg3 |
Achievement data available grade 3 | |
flagg4 |
Achievement data available grade 4 | |
flagg5 |
Achievement data available grade 5 | |
flagg6 |
Achievement data available grade 6 | |
flagg7 |
Achievement data available grade 7 | |
flagg8 |
Achievement data available grade 8 | |
| High School Data Flags | flagsatact |
Valid SAT/ACT score available |
flaghscourse |
At least two years of high school course data | |
flaghsgraduate |
Data on high school graduation status available | |
| Additional STAR Indicators | cmpstype |
Class type composite 10988 valid, 613 missing |
cmpsdura |
Duration composite 10988 valid, 613 missing |
Our subset will require Achievement-data Flags and High School Data Flags to be “YES” for each student. This ensures that we have complete data on student achievement from kindergarten through grade 8 and high school graduation status.
We also require that the composite class type and duration variables
(cmpstype and cmpsdura) are not missing, as
these are essential for our analysis. Missing values in these
two variables is another caveat that can be addressed, but it is beyond
the scope of this particular analysis.
We should note that the question of interest is specific to students who have completed both primary and secondary education. Therefore any concerns for selection bias is beyond the scope of this analysis.
STAR_flag_vars <- c('FLAGSGK',
'FLAGSG1',
'FLAGSG2',
'FLAGSG3')
Achievement_flag_vars <- c('flaggk',
'flagg1',
'flagg2',
'flagg3',
'flagg4',
'flagg5',
'flagg6',
'flagg7',
'flagg8')
HS_College_flag_var <- c('flagsatact',
'flaghscourse',
'flaghsgraduate')
# Subset students who have Achievement_flag_vars and HS_College_flag_var columns == 'YES'
All_Grades_Students <- STAR_Students %>%
filter(if_all(all_of(Achievement_flag_vars), ~ . == "YES") &
if_all(all_of(HS_College_flag_var), ~ . == "YES") &
# if `cmpstype` is not missing
!is.na(cmpstype) & !is.na(cmpsdura)
)
Achievement_data_students <- STAR_Students %>%
filter(if_all(all_of(Achievement_flag_vars), ~ . == "YES"))
HS_College_data_students <- STAR_Students %>%
filter(if_all(all_of(HS_College_flag_var), ~ . == "YES"))
student_count <- nrow(STAR_Students)
complete_student_count1 <- nrow(Achievement_data_students)
incomplete_student_count1 <- student_count - complete_student_count1
counts1 <- c(complete_student_count1, incomplete_student_count1)
percent1 <- round(100 * counts1 / student_count, 1)
labels1 <- paste(c("Data Available", "Data \n Unavailable"),
"\nCount:", counts1,
"\n", percent1, "%")
complete_student_count2 <- nrow(HS_College_data_students)
incomplete_student_count2 <- student_count - complete_student_count2
counts2 <- c(complete_student_count2, incomplete_student_count2)
percent2 <- round(100 * counts2 / student_count, 1)
labels2 <- paste(c("Data Available", "Data \n Unavailable"),
"\nCount:", counts2,
"\n", percent2, "%")
complete_student_count3 <- nrow(All_Grades_Students)
incomplete_student_count3 <- student_count - complete_student_count3
counts3 <- c(complete_student_count3, incomplete_student_count3)
percent3 <- round(100 * counts3 / student_count, 1)
labels3 <- paste(c("Data Available", "Data \n Unavailable"),
"\nCount:", counts3,
"\n", percent3, "%")
par(mfrow = c(2, 2), mar = c(1, 4, 1, 4))
pie(counts1, labels = labels1,
main = "Achievement Data Completeness", col = c("lightblue", "lightgray"))
pie(counts2, labels = labels2,
main = "High School Data Completeness", col = c("lightgreen", "lightgray"))
pie(counts3, labels = labels3,
main = "Achievement & High School Completeness", col = c("lightcoral", "lightgray"))
par(mfrow = c(1, 1))
We notice that the proportion of students with complete data for both achievement and high school/college readiness is relatively low compared to the total number of students in the dataset. This highlights the challenges associated with longitudinal studies and the importance of data completeness for robust analyses. Despite the limitations, we are still left with 506 students who have complete data for the longitudinal analysis.
Our analysis focuses on the long-term effects of class size on
student academic achievement. To ensure that we capture the cumulative
impact of class size exposure, we will filter the dataset to include
only students who have been in a specific class type for at least three
years. Three years or more is chosen to account for the fact that
kindergarten attendance was not mandatory by law or by study design. We
utilize the cmpsdura variable, which represents the
duration of exposure to a specific class type.
This decision addresses several concerns:
cmpsdura_table_original <- All_Grades_Students %>%
filter(!is.na(cmpsdura)) %>%
group_by(cmpsdura) %>%
summarise(count = n(), .groups = "drop") %>%
mutate(percentage = sprintf("%.2f%%", count / sum(count) * 100))
head(cmpsdura_table_original) %>%
kable(caption = "Distribution of Duration Composite (`cmpsdura`) in K-12 Data Available Students") %>%
kable_styling(full_width = FALSE) %>%
column_spec(1, bold = TRUE)
| cmpsdura | count | percentage |
|---|---|---|
| 1 | 76 | 15.02% |
| 2 | 13 | 2.57% |
| 3 | 98 | 19.37% |
| 4 | 319 | 63.04% |
All_Grades_Students <- All_Grades_Students %>%
filter(cmpsdura > 2)
We see that a majority (407 students) out of the 506 students with complete data have been in a specific class type for at least three years.
Any results or conclusions beyond this point are based on these 407 students who have complete data for the analysis.
As with the initial analysis, any conclusions made from this report can only be generalized to the population of students who have complete data and have been in a specific class type for at least three years and have completed their education in Tennessee.
We keep the following variables for the longitudinal analysis:
| Type | Variable Name | Description |
|---|---|---|
| Demographic Variables | stdntid |
Student ID |
| Class Type & STAR Participation | gkclasstype |
Class type in kindergarten |
g1classtype |
Class type in grade 1 | |
g2classtype |
Class type in grade 2 | |
g3classtype |
Class type in grade 3 | |
cmpstype |
Class type composite | |
cmpsdura |
Duration composite | |
yearsstar |
Number of years in STAR | |
| Reading & Math Scores | gktreadss |
Total reading scaled score SAT kindergarten |
gktmathss |
Total math scaled score SAT kindergarten | |
g1treadss |
Total reading scale scores SAT Grade 1 | |
g1tmathss |
Total math scale scores SAT Grade 1 | |
g2treadss |
Total math scale scores SAT Grade 2 | |
g2tmathss |
Total reading scale scores SAT Grade 2 | |
g3treadss |
Total math scale scores SAT Grade 3 | |
g3tmathss |
Total reading scale scores SAT Grade 3 | |
g4treadss |
Total math scale scores CTBS Grade 4 | |
g4tmathss |
Total reading scale scores CTBS Grade 4 | |
g5treadss |
Total math scale scores CTBS Grade 5 | |
g5tmathss |
Total reading scale scores CTBS Grade 5 | |
g6treadss |
Total math scale scores CTBS Grade 6 | |
g6tmathss |
Total reading scale scores CTBS Grade 6 | |
g7treadss |
Total math scale scores CTBS Grade 7 | |
g7tmathss |
Total reading scale scores CTBS Grade 7 | |
g8treadss |
Total math scale scores CTBS Grade 8 | |
g8tmathss |
Total reading scale scores CTBS Grade 8 | |
| High School Performance | hsgpaoverall |
GPA overall high school |
hsactcomp |
ACT composite score high school | |
hsactconverted |
SAT –> ACT (test score reported in ACT composite metric) high school |
SAT_students <- STAR_Students %>%
filter(hssat == "YES" & hsact == "NO")
ACT_students <- STAR_Students %>%
filter(hsact == "YES" & hssat == "NO")
SAT_ACT_students <- STAR_Students %>%
filter(hssat == "YES" & hsact == "YES")
no_SAT_ACT_students <- STAR_Students %>%
filter(hssat == "NO" & hsact == "NO")
num_SAT_students <- nrow(SAT_students)
num_ACT_students <- nrow(ACT_students)
num_SAT_ACT_students <- nrow(SAT_ACT_students)
num_no_SAT_ACT_students <- nrow(no_SAT_ACT_students)
df <- data.frame(
Category = c("SAT Only", "ACT Only", "Both SAT & ACT", "Neither SAT nor ACT"),
Count = c(num_SAT_students, num_ACT_students, num_SAT_ACT_students, num_no_SAT_ACT_students)
)
df$Percentage <- round(df$Count / sum(df$Count) * 100, 1)
pie_chart4 <- plot_ly(
data = df,
labels = ~Category,
values = ~Percentage,
type = 'pie',
textinfo = 'percent',
hoverinfo = 'label+percent+text',
text = ~paste0("Count: ", Count),
marker = list(colors = c("lightblue", "lightgreen", "lightcoral", "lightgray")),
title = "SAT and ACT Participation (Original Data)",
width = 500, height = 400
)
SAT_students <- All_Grades_Students %>%
filter(hssat == "YES" & hsact == "NO")
ACT_students <- All_Grades_Students %>%
filter(hsact == "YES" & hssat == "NO")
SAT_ACT_students <- All_Grades_Students %>%
filter(hssat == "YES" & hsact == "YES")
no_SAT_ACT_students <- All_Grades_Students %>%
filter(hssat == "NO" & hsact == "NO")
num_SAT_students <- nrow(SAT_students)
num_ACT_students <- nrow(ACT_students)
num_SAT_ACT_students <- nrow(SAT_ACT_students)
num_no_SAT_ACT_students <- nrow(no_SAT_ACT_students)
df <- data.frame(
Category = c("SAT Only", "ACT Only", "Both SAT & ACT", "Neither SAT nor ACT"),
Count = c(num_SAT_students, num_ACT_students, num_SAT_ACT_students, num_no_SAT_ACT_students)
)
df$Percentage <- round(df$Count / sum(df$Count) * 100, 1)
pie_chart5 <- plot_ly(
data = df,
labels = ~Category,
values = ~Percentage,
type = 'pie',
textinfo = 'percent',
hoverinfo = 'label+percent+text',
text = ~paste0("Count: ", Count),
marker = list(colors = c("lightblue", "lightgreen", "lightcoral", "lightgray")),
title = "SAT and ACT Participation (subsetted students)",
width = 500, height = 400
)
# Subset data for STAR_Students
SAT_students <- STAR_Students[STAR_Students$hssat == "YES" & STAR_Students$hsact == "NO", ]
ACT_students <- STAR_Students[STAR_Students$hsact == "YES" & STAR_Students$hssat == "NO", ]
SAT_ACT_students <- STAR_Students[STAR_Students$hssat == "YES" & STAR_Students$hsact == "YES", ]
no_SAT_ACT_students <- STAR_Students[STAR_Students$hssat == "NO" & STAR_Students$hsact == "NO", ]
num_SAT_students <- nrow(SAT_students)
num_ACT_students <- nrow(ACT_students)
num_SAT_ACT_students <- nrow(SAT_ACT_students)
num_no_SAT_ACT_students <- nrow(no_SAT_ACT_students)
# Data for first pie chart
counts1 <- c(num_ACT_students, num_SAT_students, num_SAT_ACT_students, num_no_SAT_ACT_students)
labels1 <- c("ACT Only", "SAT Only", "Both SAT & ACT", "Neither SAT nor ACT")
percentages1 <- round(counts1 / sum(counts1) * 100, 1)
labels1 <- paste(labels1, "\n", percentages1, "%")
# Subset data for All_Grades_Students
SAT_students <- All_Grades_Students[All_Grades_Students$hssat == "YES" & All_Grades_Students$hsact == "NO", ]
ACT_students <- All_Grades_Students[All_Grades_Students$hsact == "YES" & All_Grades_Students$hssat == "NO", ]
SAT_ACT_students <- All_Grades_Students[All_Grades_Students$hssat == "YES" & All_Grades_Students$hsact == "YES", ]
num_SAT_students <- nrow(SAT_students)
num_ACT_students <- nrow(ACT_students)
num_SAT_ACT_students <- nrow(SAT_ACT_students)
# Data for second pie chart
counts2 <- c(num_ACT_students, num_SAT_students, num_SAT_ACT_students)
labels2 <- c("ACT Only", "SAT Only", "Both SAT & ACT")
percentages2 <- round(counts2 / sum(counts2) * 100, 1)
labels2 <- paste(labels2, "\n", percentages2, "%")
# Set graphical parameters for side-by-side plots
par(mfrow = c(1, 2), mar = c(1, 4, 10, 4))
# Pie chart for STAR_Students
pie(counts1, labels = labels1, col = c("lightgreen", "lightblue", "lightcoral", "lightgray"),
main = "SAT & ACT Participation \n (Original Data)")
# Pie chart for All_Grades_Students
pie(counts2, labels = labels2, col = c("lightgreen", "lightblue", "lightcoral"),
main = "SAT & ACT Participation \n (Subsetted Students - \n New Population)")
# Reset graphical parameters
par(mfrow = c(1, 1))
To help us utilize standardized college entrance exam scores, we need both ACT and SAT scores to be comparable (i.e. on the same scale). The ACT scores can range between 1 to 36, while the SAT scores can range between 400 to 1600 during the time of the study. While the SAT score ranges changed over time, at the time of this report, it has reverted back to the same 400-1600 scale.
Converting the scores to match either exam scale will inevitably lead
to some loss of information, but it is necessary for meaningful
comparisons. To minimize this loss of information, we
will keep the popular ACT composite score, hsactcomp (taken
by 90% of new sample of interest), as is and utilize the converted SAT
composite to ACT score (hsactconverted) for the
analysis.
variables_of_interest <- c(
'stdntid',
'gkclasstype',
'g1classtype',
'g2classtype',
'g3classtype',
'cmpstype',
'cmpsdura',
'yearsstar',
'gktreadss',
'gktmathss',
'g1treadss',
'g1tmathss',
'g2treadss',
'g2tmathss',
'g3treadss',
'g3tmathss',
'g4treadss',
'g4tmathss',
'g5treadss',
'g5tmathss',
'g6treadss',
'g6tmathss',
'g7treadss',
'g7tmathss',
'g8treadss',
'g8tmathss',
'hsgpaoverall',
'hsactcomp',
'hsactconverted'
)
# Keep variables_of_interest from All_Grades_Students
subsetted_data <- All_Grades_Students %>%
select(all_of(variables_of_interest))
# Table of percentages and counts for class type (`cmpstype`)
class_type_table_original <- STAR_Students %>%
filter(!is.na(cmpstype)) %>%
group_by(cmpstype) %>%
summarise(count = n(), .groups = "drop") %>%
mutate(percentage = sprintf("%.2f%%", count / sum(count) * 100))
head(class_type_table_original) %>%
kable(caption = "Distribution of Class Types (`cmpstype`) in Original Data") %>%
kable_styling(full_width = FALSE) %>%
column_spec(1, bold = TRUE)
| cmpstype | count | percentage |
|---|---|---|
| SMALL | 3202 | 29.14% |
| REGULAR | 3045 | 27.71% |
| AIDE | 4741 | 43.15% |
class_type_table <- subsetted_data %>%
group_by(cmpstype) %>%
summarise(count = n(), .groups = "drop") %>%
mutate(percentage = sprintf("%.2f%%", count / sum(count) * 100))
# Print using kable
head(class_type_table) %>%
kable(caption = "Distribution of Class Types (`cmpstype`) in Subsetted Data") %>%
kable_styling(full_width = FALSE) %>%
column_spec(1, bold = TRUE)
| cmpstype | count | percentage |
|---|---|---|
| SMALL | 167 | 40.05% |
| REGULAR | 66 | 15.83% |
| AIDE | 184 | 44.12% |
In this section, we will verify the initial analysis results with the
new sample/population of interest. The initial analysis used the scaled
math scores of 1st-grade students to investigate the impact of class
size on academic performance. The two-way ANOVA model showed that the
school (schoolid) did not have a significant effect on math
scores, while the class type (cmpstype) did have a
significant effect. We will replicate this analysis with the new subset
of students.
# Histogram of math scaled scores for grade 1 students (original data)
math_hist_original <- ggplot(STAR_Students, aes(x = g1tmathss)) +
geom_histogram(binwidth = 5, fill = "skyblue", color = "black") +
labs(title = "Distribution of Math Scaled Scores (Grade 1)",
x = "Math Scaled Score",
y = "Frequency") +
theme_minimal()
# Histogram of math scaled scores for grade 1 students (subsetted data)
math_hist_subset <- ggplot(subsetted_data, aes(x = g1tmathss)) +
geom_histogram(binwidth = 5, fill = "skyblue", color = "black") +
labs(title = "Distribution of Math Scaled Scores (Grade 1)",
x = "Math Scaled Score",
y = "Frequency") +
theme_minimal()
# Side by side boxplots of math scaled scores by class type (`g1classtype`) for original and subsetted data
math_boxplot <- ggplot(STAR_Students, aes(x = g1classtype, y = g1tmathss)) +
geom_boxplot(fill = "skyblue", color = "black") +
geom_jitter(width = 0.2, color = "navy", alpha = 0.05) +
labs(title = "Math Scores by Class Type (Original)",
x = "Class Type",
y = "Math Scaled Score") +
theme_minimal()
math_boxplot_subset <- ggplot(subsetted_data, aes(x = g1classtype, y = g1tmathss)) +
geom_boxplot(fill = "skyblue", color = "black") +
geom_jitter(width = 0.2, color = "navy", alpha = 0.5) +
labs(title = "Math Scores by Class Type (New)",
x = "Class Type",
y = "Math Scaled Score") +
theme_minimal()
math_hist_original + math_hist_subset
math_boxplot + math_boxplot_subset
We will conduct a one-way ANOVA to test the null hypothesis that the mean math scaled scores are the same across different class types in grade 1 in this new population. The alternative hypothesis is that at least one class type has a different mean math scaled score.
The model is as follows:
\[\begin{align*} Y_{ij} = \mu + \alpha_i + \epsilon_{ij} \end{align*}\]
where:
The independence, normality, and homoscedasticity assumptions are required for the validity of the ANOVA results.
# Main effect plot of class type
class_type_effect_plot_info <- subsetted_data %>%
group_by(g1classtype) %>%
summarise(mean_math_score = mean(g1tmathss, na.rm = TRUE),
SE = sd(g1tmathss, na.rm = TRUE) / sqrt(n()),
Lower_CI = mean_math_score - qt(0.975, n() - 1) * SE,
Upper_CI = mean_math_score + qt(0.975, n() - 1) * SE)
class_type_effect_plot <- ggplot(class_type_effect_plot_info, aes(x = g1classtype, y = mean_math_score)) +
geom_point(size = 3, color = "skyblue") +
geom_line(aes(group = 1), color = "skyblue") +
geom_errorbar(aes(ymin = Lower_CI, ymax = Upper_CI), width = 0.2) +
labs(title = "Main Effect of Class Type on Math Scores (Grade 1)",
x = "Class Type",
y = "Mean Math Scaled Score") +
theme_minimal()
class_type_effect_plot
# One-way ANOVA model
aov1 <- aov(g1tmathss ~ g1classtype, data = subsetted_data)
# Summary of ANOVA
aov1_summary <- broom::tidy(aov1)
kable(aov1_summary, caption = "ANOVA Summary for Math Scaled Scores by Class Type (Grade 1) in New Population of Interest") %>%
kable_styling(full_width = FALSE) %>%
column_spec(1, bold = TRUE)
| term | df | sumsq | meansq | statistic | p.value |
|---|---|---|---|---|---|
| g1classtype | 2 | 24343.58 | 12171.791 | 8.963494 | 0.0001546 |
| Residuals | 413 | 560824.79 | 1357.929 | NA | NA |
Like the initial analysis, the one-way ANOVA results show that the class type has a significant effect on math scaled scores in grade 1 for the new population of interest.
In the initial analysis, the Tukey’s HSD post-hoc test identified that students in small classes have significantly higher math scores compared to students in either regular or regular-aide classes. We will conduct the same post-hoc test to identify which class types have significantly different math scores in this new population.
aov1_tukey_results <- TukeyHSD(aov1)
kable(tidy(aov1_tukey_results),
caption = "Tukey's HSD Post-hoc Test for Math Scaled Scores by Class Type (Grade 1) in New Population of Interest") %>%
kable_styling(full_width = FALSE) %>%
column_spec(1, bold = TRUE)
| term | contrast | null.value | estimate | conf.low | conf.high | adj.p.value |
|---|---|---|---|---|---|---|
| g1classtype | REGULAR CLASS-SMALL CLASS | 0 | -19.553846 | -32.215146 | -6.892546 | 0.0009198 |
| g1classtype | REGULAR + AIDE CLASS-SMALL CLASS | 0 | -13.371585 | -22.633163 | -4.110006 | 0.0021610 |
| g1classtype | REGULAR + AIDE CLASS-REGULAR CLASS | 0 | 6.182262 | -6.333442 | 18.697965 | 0.4767002 |
aov1_tukey_df <- as.data.frame(aov1_tukey_results$g1classtype)
aov1_tukey_df$comparison <- rownames(aov1_tukey_df)
colnames(aov1_tukey_df) <- c("Difference", "Lower_CI", "Upper_CI", "p_value", "Comparison")
ggplot(aov1_tukey_df, aes(x = Comparison, y = Difference)) +
geom_point(aes(color = p_value < 0.05), size = 3) +
geom_errorbar(aes(ymin = Lower_CI, ymax = Upper_CI), width = 0.2) +
geom_hline(yintercept = 0, linetype = "dashed", color = "red") +
labs(title = "Tukey's HSD Post-hoc Test for Math Scaled Scores by Class Type (Grade 1)",
x = "Class Type Comparison",
y = "Difference in Mean Math Scaled Scores") +
coord_flip() +
theme_minimal()
We observe that the Tukey’s HSD post-hoc test results are consistent with the initial analysis. Students in small classes have significantly higher math scores compared to students in regular and regular-aide classes in this new population of interest.
Although not covered in the initial report, we want to also verify the short term effect of class size on reading scaled scores in grade 1 for the new population of interest.
# One-way ANOVA model for reading scaled scores by class type (grade 1)
aov1_reading <- aov(g1treadss ~ g1classtype, data = subsetted_data)
# Summary of ANOVA
aov1_reading_summary <- broom::tidy(aov1_reading)
kable(aov1_reading_summary, caption = "ANOVA Summary for Reading Scaled Scores by Class Type (Grade 1) in New Population of Interest") %>%
kable_styling(full_width = FALSE) %>%
column_spec(1, bold = TRUE)
| term | df | sumsq | meansq | statistic | p.value |
|---|---|---|---|---|---|
| g1classtype | 2 | 40819.13 | 20409.563 | 8.565375 | 0.0002274 |
| Residuals | 404 | 962650.57 | 2382.798 | NA | NA |
# Tukey's HSD post-hoc test for reading scaled scores by class type (grade 1)
aov1_reading_tukey_results <- TukeyHSD(aov1_reading)
kable(tidy(aov1_reading_tukey_results),
caption = "Tukey's HSD Post-hoc Test for Reading Scaled Scores by Class Type (Grade 1) in New Population of Interest") %>%
kable_styling(full_width = FALSE) %>%
column_spec(1, bold = TRUE)
| term | contrast | null.value | estimate | conf.low | conf.high | adj.p.value |
|---|---|---|---|---|---|---|
| g1classtype | REGULAR CLASS-SMALL CLASS | 0 | -28.95451 | -45.975017 | -11.9340030 | 0.0002204 |
| g1classtype | REGULAR + AIDE CLASS-SMALL CLASS | 0 | -13.18388 | -25.579625 | -0.7881252 | 0.0340065 |
| g1classtype | REGULAR + AIDE CLASS-REGULAR CLASS | 0 | 15.77063 | -1.038617 | 32.5798866 | 0.0711872 |
aov1_reading_tukey_df <- as.data.frame(aov1_reading_tukey_results$g1classtype)
aov1_reading_tukey_df$comparison <- rownames(aov1_reading_tukey_df)
colnames(aov1_reading_tukey_df) <- c("Difference", "Lower_CI", "Upper_CI", "p_value", "Comparison")
ggplot(aov1_reading_tukey_df, aes(x = Comparison, y = Difference)) +
geom_point(aes(color = p_value < 0.05), size = 3) +
geom_errorbar(aes(ymin = Lower_CI, ymax = Upper_CI), width = 0.2) +
geom_hline(yintercept = 0, linetype = "dashed", color = "red") +
labs(title = "Tukey's HSD Post-hoc Test for Reading Scaled Scores by Class Type (Grade 1)",
x = "Class Type Comparison",
y = "Difference in Mean Reading Scaled Scores") +
coord_flip() +
theme_minimal()
We can observe that the results for reading scaled scores are consistent with the math scaled scores. Students in small classes have significantly higher reading scores compared to students in regular and regular-aide classes in this new population of interest.
We see that for this immediate (very short term) effect of class size on academic performance, the results are consistent with the initial analysis.
We will now investigate the impact of class size on student
achievement in grades beyond STAR (grades 4-8). We will
focus on reading and math scaled scores for grades 4-8
and analyze the effect of class type on these scores. Since we are now
exploring effects after the STAR program, we will use
cmpstype as the class type variable, which at this point
represents the class type that the student spent at least 3-4 years in
during K-3.
We will conduct a one-way ANOVA to test the null hypothesis that the mean math scaled scores are the same across different class types in grades 4-8. The alternative hypothesis is that at least one class type has a different mean math scaled score.
The model for each grade is as follows:
\[\begin{align*} Y_{ij} = \mu + \alpha_i + \epsilon_{ij} \end{align*}\]
where:
The independence, normality, and homoscedasticity assumptions are required for the validity of the ANOVA results.
## Main effects plot of class type for math scores in grade 4
#
# class_type_effect_plot_info <- subsetted_data %>%
# group_by(cmpstype) %>%
# summarise(mean_g4_math = mean(g4tmathss, na.rm = TRUE),
# SE = sd(g4tmathss, na.rm = TRUE) / sqrt(n()),
# Lower_CI = mean_g4_math - qt(0.975, n() - 1) * SE,
# Upper_CI = mean_g4_math + qt(0.975, n() - 1) * SE)
#
# class_type_effect_plot <- ggplot(class_type_effect_plot_info, aes(x = cmpstype, y = mean_g4_math)) +
# geom_point(size = 3, color = "skyblue") +
# geom_line(aes(group = 1), color = "skyblue") +
# geom_errorbar(aes(ymin = Lower_CI, ymax = Upper_CI), width = 0.2) +
# labs(title = "Main Effect of Class Type on Math Scores (Grade 4)",
# x = "Class Type",
# y = "Mean Math Scaled Score") +
# theme_minimal()
#
# class_type_effect_plot
# One-way ANOVA model for math scaled scores by class type (grades 4-8)
aov_g4_math <- aov(g4tmathss ~ cmpstype, data = subsetted_data)
aov_g5_math <- aov(g5tmathss ~ cmpstype, data = subsetted_data)
aov_g6_math <- aov(g6tmathss ~ cmpstype, data = subsetted_data)
aov_g7_math <- aov(g7tmathss ~ cmpstype, data = subsetted_data)
aov_g8_math <- aov(g8tmathss ~ cmpstype, data = subsetted_data)
# Kable for ANOVA results
aov_g4_math_summary <- broom::tidy(aov_g4_math)
aov_g5_math_summary <- broom::tidy(aov_g5_math)
aov_g6_math_summary <- broom::tidy(aov_g6_math)
aov_g7_math_summary <- broom::tidy(aov_g7_math)
aov_g8_math_summary <- broom::tidy(aov_g8_math)
kable(aov_g4_math_summary, caption = "ANOVA Summary for Math Scaled Scores by Class Type (Grade 4) in New Population of Interest") %>%
kable_styling(full_width = FALSE) %>%
column_spec(1, bold = TRUE)
| term | df | sumsq | meansq | statistic | p.value |
|---|---|---|---|---|---|
| cmpstype | 2 | 5151.582 | 2575.791 | 1.887543 | 0.1527476 |
| Residuals | 413 | 563590.627 | 1364.626 | NA | NA |
kable(aov_g5_math_summary, caption = "ANOVA Summary for Math Scaled Scores by Class Type (Grade 5) in New Population of Interest") %>%
kable_styling(full_width = FALSE) %>%
column_spec(1, bold = TRUE)
| term | df | sumsq | meansq | statistic | p.value |
|---|---|---|---|---|---|
| cmpstype | 2 | 2126.525 | 1063.263 | 0.8839805 | 0.4139153 |
| Residuals | 413 | 496761.437 | 1202.812 | NA | NA |
kable(aov_g6_math_summary, caption = "ANOVA Summary for Math Scaled Scores by Class Type (Grade 6) in New Population of Interest") %>%
kable_styling(full_width = FALSE) %>%
column_spec(1, bold = TRUE)
| term | df | sumsq | meansq | statistic | p.value |
|---|---|---|---|---|---|
| cmpstype | 2 | 2833.384 | 1416.692 | 1.010973 | 0.3647677 |
| Residuals | 411 | 575940.375 | 1401.315 | NA | NA |
kable(aov_g7_math_summary, caption = "ANOVA Summary for Math Scaled Scores by Class Type (Grade 7) in New Population of Interest") %>%
kable_styling(full_width = FALSE) %>%
column_spec(1, bold = TRUE)
| term | df | sumsq | meansq | statistic | p.value |
|---|---|---|---|---|---|
| cmpstype | 2 | 6899.988 | 3449.994 | 2.269325 | 0.1046665 |
| Residuals | 414 | 629393.139 | 1520.273 | NA | NA |
kable(aov_g8_math_summary, caption = "ANOVA Summary for Math Scaled Scores by Class Type (Grade 8) in New Population of Interest") %>%
kable_styling(full_width = FALSE) %>%
column_spec(1, bold = TRUE)
| term | df | sumsq | meansq | statistic | p.value |
|---|---|---|---|---|---|
| cmpstype | 2 | 5313.938 | 2656.969 | 1.630356 | 0.1971178 |
| Residuals | 413 | 673060.445 | 1629.686 | NA | NA |
# Line plot of mean math scores by class type for grades 4-8
math_scores_by_grade <- subsetted_data %>%
group_by(cmpstype) %>%
summarise(mean_g4_math = mean(g4tmathss, na.rm = TRUE),
mean_g5_math = mean(g5tmathss, na.rm = TRUE),
mean_g6_math = mean(g6tmathss, na.rm = TRUE),
mean_g7_math = mean(g7tmathss, na.rm = TRUE),
mean_g8_math = mean(g8tmathss, na.rm = TRUE))
math_scores_by_grade_long <- math_scores_by_grade %>%
pivot_longer(cols = starts_with("mean"), names_to = "Grade", values_to = "Mean_Math_Score")
math_scores_plot1 <- ggplot(math_scores_by_grade_long, aes(x = cmpstype, y = Mean_Math_Score, color = Grade)) +
geom_point(size = 3) +
geom_line(aes(group = Grade)) +
labs(title = "Mean Math Scaled Scores by Class Type (Grades 4-8)",
x = "Class Type",
y = "Mean Math Scaled Score") +
theme_minimal()
math_scores_by_grade <- subsetted_data %>%
group_by(cmpstype) %>%
summarise(mean_g4_math = mean(g4tmathss, na.rm = TRUE),
mean_g5_math = mean(g5tmathss, na.rm = TRUE),
mean_g6_math = mean(g6tmathss, na.rm = TRUE),
mean_g7_math = mean(g7tmathss, na.rm = TRUE),
mean_g8_math = mean(g8tmathss, na.rm = TRUE)) %>%
pivot_longer(cols = starts_with("mean"),
names_to = "Grade",
values_to = "Mean_Math_Score") %>%
mutate(Grade = factor(Grade,
levels = c("mean_g4_math", "mean_g5_math", "mean_g6_math", "mean_g7_math", "mean_g8_math"),
labels = c("4", "5", "6", "7", "8")))
math_scores_plot2 <- ggplot(math_scores_by_grade, aes(x = Grade, y = Mean_Math_Score, group = cmpstype, color = cmpstype)) +
geom_line(size = 1.2) +
geom_point(size = 3) +
labs(x = "Grade",
y = "Mean Math Scaled Score",
color = "Class Type") +
theme_minimal()
math_scores_plot1 + math_scores_plot2
No significant effect of K-3 class type on math scaled scores was observed in grades 4-8 in this new population of interest. In fact, based on the clear linear trend in mean math scores by class type across grades 4-8, fitting a linear regression model would provide a better generalization that math scores increase with grade level, and that K-3 class type does not have a significant effect on math scores in grades 4-8.
We repeat the above analysis for reading scaled scores in grades 4-8 to investigate the impact of class type on reading achievement.
# One-way ANOVA model for read scaled scores by class type (grades 4-8)
aov_g4_read <- aov(g4treadss ~ cmpstype, data = subsetted_data)
aov_g5_read <- aov(g5treadss ~ cmpstype, data = subsetted_data)
aov_g6_read <- aov(g6treadss ~ cmpstype, data = subsetted_data)
aov_g7_read <- aov(g7treadss ~ cmpstype, data = subsetted_data)
aov_g8_read <- aov(g8treadss ~ cmpstype, data = subsetted_data)
# Kable for ANOVA results
aov_g4_read_summary <- broom::tidy(aov_g4_read)
aov_g5_read_summary <- broom::tidy(aov_g5_read)
aov_g6_read_summary <- broom::tidy(aov_g6_read)
aov_g7_read_summary <- broom::tidy(aov_g7_read)
aov_g8_read_summary <- broom::tidy(aov_g8_read)
kable(aov_g4_read_summary, caption = "ANOVA Summary for Read Scaled Scores by Class Type (Grade 4) in New Population of Interest") %>%
kable_styling(full_width = FALSE) %>%
column_spec(1, bold = TRUE)
| term | df | sumsq | meansq | statistic | p.value |
|---|---|---|---|---|---|
| cmpstype | 2 | 14208.83 | 7104.417 | 6.330984 | 0.0019634 |
| Residuals | 401 | 449988.72 | 1122.166 | NA | NA |
kable(aov_g5_read_summary, caption = "ANOVA Summary for Read Scaled Scores by Class Type (Grade 5) in New Population of Interest") %>%
kable_styling(full_width = FALSE) %>%
column_spec(1, bold = TRUE)
| term | df | sumsq | meansq | statistic | p.value |
|---|---|---|---|---|---|
| cmpstype | 2 | 4932.341 | 2466.170 | 1.85976 | 0.1570085 |
| Residuals | 414 | 548992.710 | 1326.069 | NA | NA |
kable(aov_g6_read_summary, caption = "ANOVA Summary for Read Scaled Scores by Class Type (Grade 6) in New Population of Interest") %>%
kable_styling(full_width = FALSE) %>%
column_spec(1, bold = TRUE)
| term | df | sumsq | meansq | statistic | p.value |
|---|---|---|---|---|---|
| cmpstype | 2 | 8735.97 | 4367.985 | 3.517853 | 0.0305551 |
| Residuals | 413 | 512806.51 | 1241.662 | NA | NA |
kable(aov_g7_read_summary, caption = "ANOVA Summary for Read Scaled Scores by Class Type (Grade 7) in New Population of Interest") %>%
kable_styling(full_width = FALSE) %>%
column_spec(1, bold = TRUE)
| term | df | sumsq | meansq | statistic | p.value |
|---|---|---|---|---|---|
| cmpstype | 2 | 4076.161 | 2038.081 | 1.721074 | 0.1801512 |
| Residuals | 414 | 490255.019 | 1184.191 | NA | NA |
kable(aov_g8_read_summary, caption = "ANOVA Summary for Read Scaled Scores by Class Type (Grade 8) in New Population of Interest") %>%
kable_styling(full_width = FALSE) %>%
column_spec(1, bold = TRUE)
| term | df | sumsq | meansq | statistic | p.value |
|---|---|---|---|---|---|
| cmpstype | 2 | 2068.782 | 1034.391 | 0.7731823 | 0.4622073 |
| Residuals | 414 | 553864.086 | 1337.836 | NA | NA |
# Line plot of mean read scores by class type for grades 4-8
read_scores_by_grade <- subsetted_data %>%
group_by(cmpstype) %>%
summarise(mean_g4_read = mean(g4treadss, na.rm = TRUE),
mean_g5_read = mean(g5treadss, na.rm = TRUE),
mean_g6_read = mean(g6treadss, na.rm = TRUE),
mean_g7_read = mean(g7treadss, na.rm = TRUE),
mean_g8_read = mean(g8treadss, na.rm = TRUE))
read_scores_by_grade_long <- read_scores_by_grade %>%
pivot_longer(cols = starts_with("mean"), names_to = "Grade", values_to = "Mean_Read_Score")
read_scores_plot1 <- ggplot(read_scores_by_grade_long, aes(x = cmpstype, y = Mean_Read_Score, color = Grade)) +
geom_point(size = 3) +
geom_line(aes(group = Grade)) +
labs(title = "Mean Read Scaled Scores by Class Type (Grades 4-8)",
x = "Class Type",
y = "Mean Read Scaled Score") +
theme_minimal()
read_scores_by_grade <- subsetted_data %>%
group_by(cmpstype) %>%
summarise(mean_g4_read = mean(g4treadss, na.rm = TRUE),
mean_g5_read = mean(g5treadss, na.rm = TRUE),
mean_g6_read = mean(g6treadss, na.rm = TRUE),
mean_g7_read = mean(g7treadss, na.rm = TRUE),
mean_g8_read = mean(g8treadss, na.rm = TRUE)) %>%
pivot_longer(cols = starts_with("mean"),
names_to = "Grade",
values_to = "Mean_Read_Score") %>%
mutate(Grade = factor(Grade,
levels = c("mean_g4_read", "mean_g5_read", "mean_g6_read", "mean_g7_read", "mean_g8_read"),
labels = c("4", "5", "6", "7", "8")))
read_scores_plot2 <- ggplot(read_scores_by_grade, aes(x = Grade, y = Mean_Read_Score, group = cmpstype, color = cmpstype)) +
geom_line(size = 1.2) +
geom_point(size = 3) +
labs(x = "Grade",
y = "Mean Read Scaled Score",
color = "Class Type") +
theme_minimal()
read_scores_plot1 + read_scores_plot2
Unlike math scaled scores, we actually observe significant effects of class type on reading scaled scores in grades 4 and 6. However, the effect is not consistent across all grades. We can run Tukey’s HSD post-hoc tests to identify which class types have significantly different reading scores in grades 4 and 6.
# Tukey's HSD post-hoc test for reading scaled scores by class type (grades 4 and 6)
aov_g4_read_tukey_results <- TukeyHSD(aov_g4_read)
aov_g6_read_tukey_results <- TukeyHSD(aov_g6_read)
kable(tidy(aov_g4_read_tukey_results),
caption = "Tukey's HSD Post-hoc Test for Read Scaled Scores by Class Type (Grade 4) in New Population of Interest") %>%
kable_styling(full_width = FALSE) %>%
column_spec(1, bold = TRUE)
| term | contrast | null.value | estimate | conf.low | conf.high | adj.p.value |
|---|---|---|---|---|---|---|
| cmpstype | REGULAR-SMALL | 0 | -14.642573 | -26.411027 | -2.874118 | 0.0100981 |
| cmpstype | AIDE-SMALL | 0 | -10.838272 | -19.372575 | -2.303968 | 0.0083746 |
| cmpstype | AIDE-REGULAR | 0 | 3.804301 | -7.800143 | 15.408746 | 0.7209176 |
kable(tidy(aov_g6_read_tukey_results),
caption = "Tukey's HSD Post-hoc Test for Read Scaled Scores by Class Type (Grade 6) in New Population of Interest") %>%
kable_styling(full_width = FALSE) %>%
column_spec(1, bold = TRUE)
| term | contrast | null.value | estimate | conf.low | conf.high | adj.p.value |
|---|---|---|---|---|---|---|
| cmpstype | REGULAR-SMALL | 0 | -4.513128 | -16.63037 | 7.604118 | 0.6557036 |
| cmpstype | AIDE-SMALL | 0 | -9.973412 | -18.83193 | -1.114891 | 0.0228173 |
| cmpstype | AIDE-REGULAR | 0 | -5.460284 | -17.41967 | 6.499102 | 0.5308003 |
aov_g4_read_tukey_df <- as.data.frame(aov_g4_read_tukey_results$cmpstype)
aov_g4_read_tukey_df$comparison <- rownames(aov_g4_read_tukey_df)
colnames(aov_g4_read_tukey_df) <- c("Difference", "Lower_CI", "Upper_CI", "p_value", "Comparison")
g4_read_tukey <- ggplot(aov_g4_read_tukey_df, aes(x = Comparison, y = Difference)) +
geom_point(aes(color = p_value < 0.05), size = 3) +
geom_errorbar(aes(ymin = Lower_CI, ymax = Upper_CI), width = 0.2) +
geom_hline(yintercept = 0, linetype = "dashed", color = "red") +
labs(title = "Tukey's HSD: Reading Scaled Scores by Class Type (Grades 4 & 6)",
x = "Class Type Comparison",
y = "Mean Read Scaled Scores Diff. (Gr 4)") +
# Hide legends
guides(color = FALSE) +
coord_flip() +
theme_minimal()
aov_g6_read_tukey_df <- as.data.frame(aov_g6_read_tukey_results$cmpstype)
aov_g6_read_tukey_df$comparison <- rownames(aov_g6_read_tukey_df)
colnames(aov_g6_read_tukey_df) <- c("Difference", "Lower_CI", "Upper_CI", "p_value", "Comparison")
g6_read_tukey <- ggplot(aov_g6_read_tukey_df, aes(x = Comparison, y = Difference)) +
geom_point(aes(color = p_value < 0.05), size = 3) +
geom_errorbar(aes(ymin = Lower_CI, ymax = Upper_CI), width = 0.2) +
geom_hline(yintercept = 0, linetype = "dashed", color = "red") +
labs(y = "Mean Read Scaled Scores Diff. (Gr 6)",
# No x-axis
x = NULL) +
coord_flip() +
theme_minimal()
g4_read_tukey + g6_read_tukey
We can see that in grade 4, students in small classes have significantly higher reading scores compared to students in regular and regular-aide classes. In grade 6, students in small classes have significantly higher reading scores compared to students in regular-aide classes, but not compared to students in regular classes.
We notice no significant effect of class type on math scaled scores in grades 4-8. However, we observe significant effects of class type on reading scaled scores in grades 4 and 6. While difficult to generalize across all grade levels, the results suggest that K-3 class type may have a lasting impact on reading achievement in certain grades.
We now explore further into the students’ academic career, focusing on high school GPA. We will investigate the impact of K-3 class type on high school GPA.
# histogram of high school GPA
gpa_hist <- ggplot(subsetted_data, aes(x = hsgpaoverall)) +
geom_histogram(binwidth = 0.5, fill = "skyblue", color = "black") +
labs(title = "Distribution of High School GPA",
x = "High School GPA",
y = "Frequency") +
theme_minimal()
# Side by side boxplots of high school GPA by class type (`cmpstype`)
gpa_boxplot <- ggplot(subsetted_data, aes(x = cmpstype, y = hsgpaoverall)) +
geom_boxplot(fill = "skyblue",
color = "black") +
geom_jitter(width = 0.2, color = "navy", alpha = 0.5) +
labs(title = "High School GPA by Class Type",
x = "Class Type",
y = "High School GPA") +
theme_minimal()
gpa_hist + gpa_boxplot
We also employ a one-way ANOVA to test the null hypothesis that the mean high school GPA is the same across different class types. The alternative hypothesis is that at least one class type has a different mean high school GPA.
The model is as follows:
\[\begin{align*} Y_{ij} = \mu + \alpha_i + \epsilon_{ij} \end{align*}\]
where:
The independence, normality, and homoscedasticity assumptions are required for the validity of the ANOVA results.
# One-way ANOVA model for high school GPA by class type
aov_gpa <- aov(hsgpaoverall ~ cmpstype, data = subsetted_data)
# Summary of ANOVA
aov_gpa_summary <- broom::tidy(aov_gpa)
kable(aov_gpa_summary, caption = "ANOVA Summary for High School GPA by Class Type in New Population of Interest") %>%
kable_styling(full_width = FALSE) %>%
column_spec(1, bold = TRUE)
| term | df | sumsq | meansq | statistic | p.value |
|---|---|---|---|---|---|
| cmpstype | 2 | 43.27034 | 21.63517 | 0.5579172 | 0.5728504 |
| Residuals | 395 | 15317.49209 | 38.77846 | NA | NA |
The one-way ANOVA results show that the class type does not have a significant effect on high school GPA in this new population of interest.
We will create a new column college_readiness that keeps
the ACT and SAT (converted to ACT scale) scores for students. If a
student has taken both exams, we will use the maximum score. This
combined score of college readiness exams will serve as a proxy for
college readiness.
subsetted_data <- subsetted_data %>%
mutate(college_readiness = pmax(hsactcomp, hsactconverted, na.rm = TRUE))
Similar to the previous analyses, we will investigate the impact of K-3 class type on college readiness scores using a one-way ANOVA.
# Histogram of college readiness scores
readiness_hist <- ggplot(subsetted_data, aes(x = college_readiness)) +
geom_histogram(binwidth = 1, fill = "skyblue", color = "black") +
labs(title = "Distribution of College Readiness Scores",
x = "College Readiness Score",
y = "Frequency") +
theme_minimal()
# Side by side boxplots of college readiness scores by class type (`cmpstype`)
readiness_boxplot <- ggplot(subsetted_data, aes(x = cmpstype, y = college_readiness)) +
geom_boxplot(fill = "skyblue",
color = "black") +
geom_jitter(width = 0.2, color = "navy", alpha = 0.5) +
labs(title = "College Readiness Scores by Class Type",
x = "Class Type",
y = "College Readiness Score") +
theme_minimal()
readiness_hist + readiness_boxplot
The one-way ANOVA model for college readiness scores by class type is as follows:
\[\begin{align*} Y_{ij} = \mu + \alpha_i + \epsilon_{ij} \end{align*}\]
where:
The independence, normality, and homoscedasticity assumptions are required for the validity of the ANOVA results.
# One-way ANOVA model for college readiness scores by class type
aov_readiness <- aov(college_readiness ~ cmpstype, data = subsetted_data)
# Summary of ANOVA
aov_readiness_summary <- broom::tidy(aov_readiness)
kable(aov_readiness_summary, caption = "ANOVA Summary for College Readiness Scores by Class Type in New Population of Interest") %>%
kable_styling(full_width = FALSE) %>%
column_spec(1, bold = TRUE)
| term | df | sumsq | meansq | statistic | p.value |
|---|---|---|---|---|---|
| cmpstype | 2 | 35.1403 | 17.57015 | 0.9244183 | 0.3975795 |
| Residuals | 414 | 7868.7782 | 19.00671 | NA | NA |
Again, the one-way ANOVA results show that the K-3 class type does not have a significant effect on college readiness scores in this new population of interest.
Immediate (Very Short Term) Effect of Class Size on Academic Performance: The results are consistent with the initial analysis, showing that students in small classes have significantly higher math scores compared to students in regular and regular-aide classes.
Impact of Class Size on Grades 4-8 Achievement: No significant effect of K-3 class type on math scaled scores in grades 4-8 was observed. However, significant effects of class type on reading scaled scores were observed in grades 4 and 6. Due to the inconsistency across grades, generalizing the effect of K-3 class type on reading achievement is challenging.
Impact of Class Size on High School Achievement: The class type does not have a significant effect on high school GPA or college readiness scores in this new population of interest.
The extended research question was:
For students who complete both primary and secondary education with the objective of pursuing higher education (college), does the exposure to small class sizes in early education (K-3) have a significant impact on their high school academic performance and college readiness?
Based on the supporting analyses conducted, we can conclude that the effects of class size on student achievement do not persist beyond the early grades (K-3) to significantly influence high school GPA or college readiness in this new population of interest. While there were some significant effects on reading scores in grades 4 and 6, the overall impact on high school GPA and college readiness was not significant.
The results suggest that while small class sizes may have immediate short-term benefits on academic performance, these effects do not appear to have a lasting impact on high school academic achievement or college readiness. Other factors, such as individual student characteristics, school resources, and teaching quality, may play a more significant role in determining high school GPA and college readiness. These are interesting areas that may be more fit for an observational study to explore the complex interplay of various factors on long-term academic outcomes.
Ideally, we’d like to draw causal conclusions about the impact of K-3 class type on student achievement. For causal inference to be valid, the following assumptions must be satisfied:
Random Assignment (Ignorability):
Stable Unit Treatment Value Assumption (SUTVA):
Consistency:
Positivity:
Given the robustness of Project STAR’s experimental design, we can reasonably interpret observed effects, particularly immediate short-term impacts, as causal. The significant short-term effect of smaller class sizes on student math achievement is credible.
However, for the longer-term effects (grades 4–8 and beyond), causal interpretations become less straightforward due to:
In summary, while the assumptions for causal inference are largely satisfied, particularly in the short-term analysis due to the randomized controlled trial design, cautious interpretation is advised for long-term causal claims. Immediate effects of small class sizes can be confidently viewed as causal, but long-term academic outcomes should be interpreted with consideration of potential attrition biases and external educational factors.
The following section plots all the diagnostic plots for all the ANOVA models conducted in the supporting analyses. These plots help us assess the validity of the ANOVA results by checking the assumptions of independence, normality, and homoscedasticity.
# Model Diagnostics
# Set graphical parameters
par(mfrow = c(3, 2), mar = c(4, 4, 2, 1))
# List of ANOVA models
anova_models <- list(
"Grade 1 Math" = aov1,
"Grade 4 Math" = aov_g4_math,
"Grade 5 Math" = aov_g5_math,
"Grade 6 Math" = aov_g6_math,
"Grade 7 Math" = aov_g7_math,
"Grade 8 Math" = aov_g8_math,
"Grade 4 Reading" = aov_g4_read,
"Grade 5 Reading" = aov_g5_read,
"Grade 6 Reading" = aov_g6_read,
"Grade 7 Reading" = aov_g7_read,
"Grade 8 Reading" = aov_g8_read,
"High School GPA" = aov_gpa,
"College Readiness" = aov_readiness
)
# Loop through models and generate diagnostics
for (model_name in names(anova_models)) {
model <- anova_models[[model_name]]
# Residuals vs. Fitted
plot(model$fitted.values, resid(model),
main = paste(model_name, "- Residuals vs Fitted"),
xlab = "Fitted Values", ylab = "Residuals", pch = 19,
col = rgb(0, 0, 1, alpha = 0.5))
abline(h = 0, col = "red", lwd = 1.5, lty = 2)
# QQ Plot
qqnorm(resid(model), main = paste(model_name, "- Normal Q-Q"), pch = 19,
col = rgb(0, 0.5, 0, alpha = 0.5))
qqline(resid(model), col = "red", lwd = 1.5, lty = 2)
}
# Reset graphical parameters
par(mfrow = c(1, 1))
We note that based on these plots, not all variables satisfy the assumptions of normality and homoscedasticity. This further emphasizes the need for caution in interpreting the ANOVA results, particularly for long-term effects.
I acknowledge and thank my classmates Hangyu Li and Shang Chen for their helpful discussions and sharing unique approaches to the analysis.
Achilles, C. M., Bain, H. P., Bellott, F., Boyd-Zaharias, J., Finn, J., Folger, J., Johnston, J., & Word, E. (2008). Tennessee’s Student Teacher Achievement Ratio (STAR) project [Data set]. Harvard Dataverse. https://doi.org/10.7910/DVN/SIWH9F
C.M. Achilles; Helen Pate Bain; Fred Bellott; Jayne Boyd-Zaharias; Jeremy Finn; John Folger; John Johnston; Elizabeth Word, 2008, “Tennessee’s Student Teacher Achievement Ratio (STAR) project”, https://doi.org/10.7910/DVN/SIWH9F, Harvard Dataverse, V1; Project STAR K-3
C.M. Achilles; Helen Pate Bain; Fred Bellott; Jayne Boyd-Zaharias; Jeremy Finn; John Folger; John Johnston; Elizabeth Word, 2008, “Tennessee’s Student Teacher Achievement Ratio (STAR) project”, https://doi.org/10.7910/DVN/SIWH9F, Harvard Dataverse, V1; STAR follow-up Studies (1996-1997) Report
C.M. Achilles; Helen Pate Bain; Fred Bellott; Jayne Boyd-Zaharias; Jeremy Finn; John Folger; John Johnston; Elizabeth Word, 2008, “Tennessee’s Student Teacher Achievement Ratio (STAR) project”, https://doi.org/10.7910/DVN/SIWH9F, Harvard Dataverse, V1; starUsersGuide.pdf
Hernán, M. A., & Robins, J. M. (2025). Causal inference: What if (1st ed.). Chapman & Hall/CRC. https://www.hsph.harvard.edu/miguel-hernan/causal-inference-book/
sessionInfo()
## R version 4.4.1 (2024-06-14)
## Platform: aarch64-apple-darwin20
## Running under: macOS 15.3.1
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.12.0
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## time zone: America/Los_Angeles
## tzcode source: internal
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] plotly_4.10.4 ggalluvial_0.12.5 tidyr_1.3.1 forcats_1.0.0
## [5] foreign_0.8-86 AER_1.2-14 survival_3.6-4 sandwich_3.1-1
## [9] lmtest_0.9-40 zoo_1.8-12 broom_1.0.7 MASS_7.3-60.2
## [13] car_3.1-3 carData_3.0-5 patchwork_1.3.0 kableExtra_1.4.0
## [17] knitr_1.48 ggthemes_5.1.0 ggplot2_3.5.1 dplyr_1.1.4
##
## loaded via a namespace (and not attached):
## [1] gtable_0.3.5 xfun_0.51 bslib_0.8.0 htmlwidgets_1.6.4
## [5] lattice_0.22-6 crosstalk_1.2.1 vctrs_0.6.5 tools_4.4.1
## [9] generics_0.1.3 tibble_3.2.1 fansi_1.0.6 highr_0.11
## [13] pkgconfig_2.0.3 Matrix_1.7-0 data.table_1.16.4 lifecycle_1.0.4
## [17] compiler_4.4.1 farver_2.1.2 stringr_1.5.1 munsell_0.5.1
## [21] htmltools_0.5.8.1 sass_0.4.9 lazyeval_0.2.2 yaml_2.3.10
## [25] Formula_1.2-5 pillar_1.9.0 jquerylib_0.1.4 cachem_1.1.0
## [29] abind_1.4-8 tidyselect_1.2.1 digest_0.6.37 stringi_1.8.4
## [33] purrr_1.0.2 labeling_0.4.3 splines_4.4.1 fastmap_1.2.0
## [37] grid_4.4.1 colorspace_2.1-1 cli_3.6.3 magrittr_2.0.3
## [41] utf8_1.2.4 withr_3.0.1 scales_1.3.0 backports_1.5.0
## [45] httr_1.4.7 rmarkdown_2.28 evaluate_1.0.0 viridisLite_0.4.2
## [49] rlang_1.1.4 glue_1.8.0 xml2_1.3.6 svglite_2.1.3
## [53] rstudioapi_0.17.1 jsonlite_1.8.9 R6_2.5.1 systemfonts_1.2.1